performance aal
Contextual Sparsity with Correction for Efficient LLMs Y ang Zhou
With the blossom of large language models (LLM), inference efficiency becomes increasingly important. V arious approximate methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without significant performance degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, it significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks.
Technology:
Technology: